Dirichlet process mixture models to impute missing predictor data in counterfactual prediction models: an application to predict optimal type 2 diabetes therapy

Background The handling of missing data is a challenge for inference and regression modelling. A particular challenge is dealing with missing predictor information, particularly when trying to build and make predictions from models for use in clinical practice. Methods We utilise a flexible Bayesian approach for handling missing predictor information in regression models. This provides practitioners with full posterior predictive distributions for both the missing predictor information (conditional on the observed predictors) and the outcome-of-interest. We apply this approach to a previously proposed counterfactual treatment selection model for type 2 diabetes second-line therapies. Our approach combines a regression model and a Dirichlet process mixture model (DPMM), where the former defines the treatment selection model, and the latter provides a flexible way to model the joint distribution of the predictors. Results We show that DPMMs can model complex relationships between predictor variables and can provide powerful means of fitting models to incomplete data (under missing-completely-at-random and missing-at-random assumptions). This framework ensures that the posterior distribution for the parameters and the conditional average treatment effect estimates automatically reflect the additional uncertainties associated with missing data due to the hierarchical model structure. We also demonstrate that in the presence of multiple missing predictors, the DPMM model can be used to explore which variable(s), if collected, could provide the most additional information about the likely outcome. Conclusions When developing clinical prediction models, DPMMs offer a flexible way to model complex covariate structures and handle missing predictor information. DPMM-based counterfactual prediction models can also provide additional information to support clinical decision-making, including allowing predictions with appropriate uncertainty to be made for individuals with incomplete predictor data. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-023-02400-3.


S1.1 Component distributions
As described in Box 1, for J C continuous predictors, we use mixtures of multivariate Gaussian distributions with J C dimensions, with cluster specific parameters for component k (k = 1, . . ., K) given by (µ k , Σ k ), where µ k is a J C -vector of means and Σ k is a (J C × J C ) covariance matrix.We use a latent variable Z i to denote the cluster membership for individual i, and hence the conditional component density given Z i is: For J D categorical predictors, we use mixtures of categorical probability mass functions, where the number of categories for a covariate j (j = 1, . . ., J D ) is K j and the component-specific parameters are the probabilities of belonging to each category, given by ϕ k = (ϕ k1 , ϕ k2 , ..., ϕ kJ D ) with ϕ kj = (ϕ kj1 , ϕ kj2 , . . ., ϕ kjKj ) and Kj l=1 ϕ kjl = 1.Hence

S1.2 Prior distributions
We use a stick-breaking [1,2] construction for the prior probabilities of the mixture components p k .Conceptually, this involves repeatedly breaking off and discarding a random fraction of a stick with an initial length 1 of 1.The fraction discarded is sampled from a Beta distribution with shape parameter α (where α influences the prior weights on the number of components).
In order to improve convergence and mixing of the MCMC algorithm, all continuous variables are standardised.The regression parameters (β) are fitted with weakly informative shrinkage priors, which acts as a regularisation technique and can reduce posterior uncertainty alongside stabilising computations [3].The prior distributions for regression parameters and σ are: The clustering and components of the DPMM follows a hierarchical structure, with the following prior distributions: We set µ 0 to be a vector of means of continuous variables and we set Σ 0 to be the covariance matrix corresponding to a diagonal matrix where the diagonal entries are equal to the magnitude of the range of each continuous covariate.For R, we set the degrees of freedom to the number of continuous variables in the model, ρ 0 = 6.Although we achieve a good fit with α ∼ Gamma(shape = 2, rate = 1), this may not hold for other applied cases, and other ways of defining this hyperparameter could be considered.See Liverani et al., 2015 for a justification of these choices [4].

S1.3 Predictive distributions
The Bayesian hierarchical model has posterior distribution: If there are no missing predictor variables, then the DPMM provides no additional information about the treatment selection model parameters, and could be integrated out of (S.3).The purpose of including the DPMM is two-fold: a) to allow incomplete predictor data to be included when the model is fitted; and b) to allow predictions to be made for individuals with missing predictor information.Hence, if we have a combination of missing (X m ) and non-missing (X o ) predictors, then the target posterior distribution is: and as such we can obtain full posterior predictive distributions for all of the missing variables X m in addition to the model parameters.We can then integrate over the missing variables to get the marginal posterior distribution for the parameters-of-interest: This (multi-dimensional) integral can be done numerically using the posterior samples generated from the MCMC.Thus the Bayesian model naturally propagates the uncertainties from the missing information through to the posterior distributions for the parameters.
Similarly, since the DPMM provides a flexible joint probability model for the predictor variables, we can leverage this to produce posterior predictive distributions for a new individual with observed predictors X o * say.In this case and the DPMM parameters analytically integrate out.Following Dennis et al. [5], the plots presented in this paper ignore the residual variation and generate posterior distributions for expected responses.See Box 2 for more information.Full details are given in Section S1. 4 and example code for fitting the model, and generating posterior predictive samples is given as additional supplementary material.

S1.4 Conditional predictive sampling
Posterior predictive distributions for patients with missing information can be generated empirically using posterior samples generated from the MCMC, as long as we can sample from a conditional DPMM and the treatment selection model (both described in Section S1.3).Suppose X * is a J-dimensional vector of covariates for a new individual.We partition X * into two disjoint subsets X m * and X o * , where X m * are the missing covariates of M dimensions and X o * are the observed covariates of J −M dimensions.We will estimate the joint posterior predictive distribution for the response Y * and X m * given the observations X o * : , with assuming that µ C z and Σ C z are partitioned such that Finally, we sample independently for j = 1, . . ., M missing covariates (since X Dm * j are conditionally independent of X Do * j given cluster Z = z).
Once we have L random samples for X m * l , we can then sample (with a slight abuse of notation) from the

S1.5 Convergence diagnostics
The Bayesian model was run for 50 000 iterations in two separate chains.Inspection of α values (Figure S2C) as well as σ and regression parameters (Figure S3) revealed that the first twenty thousand iterations should be discarded as burn-in.The trace plots for the regression parameters demonstrate the remaining iterations after the burn-in suggest convergence of the model (Figure S3).During the model fit, not all components had patients assigned to them.After removing burn-in and ranking components by occupancy (Figure S2B), only 18 components were utilised with components below the 14 th rank having less than 30 patients (0.2%) assigned to them (Figure S2A).The Gelman-Rubin R values for α, σ and regression parameters vary between 1 and 1.005.

S1.6 The Bayesian treatment selection model is consistent with the original penalised maximum likelihood regression model
The posterior distributions for the regression parameters in the model fitted to incomplete data stay consistent with the equivalent model fitted to complete data in both Bayesian and maximum likelihood approaches (Figure S5).The posterior credible intervals for the Bayesian model fitted to the complete data are slightly narrower on the whole than the frequentist confidence intervals, due to the weak shrinkage priors used.We can see that the inclusion of the incomplete data results in a reduction of the posterior intervals, due to an increase in the number of data points available to inform the model fit.
Internal validation of the model shows the final model explained 29% of the variation of HbA1c outcome, with a good calibration (slope = 1.0015 [1 = perfect]) (Figure S6A).Validation of the model in the holdout dataset shows that the model explained 29% of the variation of the HbA1c outcome, alongside a good calibration (slope = 1.02) (Figure S6B).In the development dataset, 13368 patients are predicted to benefit better from SGLT2i therapy, and 2758 patients are predicted to benefit better from DPP4i therapy.In cases where SGLT2i is predicted as the optimal therapy, 176 patients are predicted a benefit >10 mmol/mol on average and 6355 patients are predicted to benefit between 5-10 mmol/mol on average.Whereas when DPP4i is predicted as the optimal therapy, 316 patients are predicted a benefit >5 mmol/mol (Figure S7A) on average.In the validation dataset, 8929 patients are predicted to benefit from SGLT2i therapy, and 1822 patients are predicted to benefit from DPP4i therapy.In cases where SGLT2i is predicted as the optimal therapy, 123 patients are predicted a benefit > 10 mmol/mol on average and 4198 patients are predicted to benefit between 5-10 mmol/mol on average.Compared with when DPP4i is predicted as the optimal therapy, 209 patients are predicted a benefit >5 mmol/mol (Figure S7B) on average.NCurrentDrug [2] NCurrentDrug [1] NPastDrug [3] NPastDrug [2] NPastDrug Supplementary Figure S5: Caterpillar plot comparing Bayesian posterior samples fitted for the complete and incomplete datasets against the frequentist coefficient estimates.For each set of Bayesian posterior samples, the plot shows the 2.5%, 5%, 50%, 95% and 97.5% quantiles.For the frequentist estimates, the plot displays the 95% confidence interval.

S2 Supplementary Tables
S3: Trace plots of regression parameters for two chains each with 30,000 iterations (20,000 iterations of burn-in removed).

Table S1 :
Model parameter Bayesian posterior samples fitted for the incomplete datasets.